Skip to content

feat: MuJoCo simulation backend - AgentTool with 35 actions#85

Open
cagataycali wants to merge 96 commits intostrands-labs:mainfrom
cagataycali:feat/mujoco-backend
Open

feat: MuJoCo simulation backend - AgentTool with 35 actions#85
cagataycali wants to merge 96 commits intostrands-labs:mainfrom
cagataycali:feat/mujoco-backend

Conversation

@cagataycali
Copy link
Copy Markdown
Member

@cagataycali cagataycali commented Apr 1, 2026

TL;DR

Complete MuJoCo simulation backend for strands-robots, shipped as a Strands AgentTool with 35 actions. An agent can spin up a physics world, load robots + objects, step physics, render RGB/depth cameras, run policies, record LeRobot-format datasets, and perform advanced physics queries — all via natural language through a single tool.

Part 4 of 6 in the MuJoCo-sim PR decomposition (follows #83 build-system, #84 sim foundation).

🧑‍⚖️ Reviewer note — this diff is large (~11.6k / −700 lines, 46 commits) but most of the noise is cosmetic. See How to review this PR below for a file-by-file reading order.


How to review this PR

There's a lot going on. To keep the review tractable, here's what actually matters vs. what's background noise.

✅ 1. Must-read — the new simulation backend

These are the ~3–4k lines of real new functionality. Review in this order:

# File Lines Purpose
1 strands_robots/simulation/base.py 460 SimEngine ABC — the public contract every backend implements
2 strands_robots/simulation/factory.py 229 create_simulation() + runtime register_backend() — lets third parties plug in new backends
3 strands_robots/simulation/mujoco/backend.py 156 Lazy import mujoco + headless GL auto-config (osmesa/egl detection)
4 strands_robots/simulation/mujoco/simulation.py 1,256 Simulation(AgentTool) — the orchestrator. All 35 agent actions live here. Primary review target.
5 strands_robots/simulation/mujoco/tool_spec.json 357 JSON schema for those 35 actions (this is what the LLM sees)
6 strands_robots/simulation/mujoco/mjcf_builder.py 215 Generate MJCF XML from dataclasses (World, Object, Robot)
7 strands_robots/simulation/mujoco/scene_ops.py 765 XML round-trip — inject/eject robots and objects from a live scene
8 strands_robots/simulation/mujoco/physics.py 867 PhysicsMixin — raycasting, jacobians, energy, forces, mass matrix, checkpoints, inverse dynamics. Each method is independent — review by feature, not top-to-bottom.
9 strands_robots/simulation/mujoco/rendering.py 563 RenderingMixin — offscreen RGB + depth cameras, multi-camera capture
10 strands_robots/simulation/mujoco/recording.py 173 RecordingMixin — LeRobot v3 dataset recording (parquet + MP4 per camera)
11 strands_robots/simulation/policy_runner.py 553 PolicyRunnerMixin — async observe→policy→act loop, run_policy, eval_policy, replay_episode
12 strands_robots/simulation/mujoco/randomization.py 81 RandomizationMixin — domain randomization
13 strands_robots/dataset_recorder.py 515 LeRobot v3 writer used by RecordingMixin

Architecture at a glance:

Simulation(AgentTool)
  ├── PhysicsMixin         # raycasting, jacobians, energy, forces,
  │                        # mass matrix, checkpoints, inverse dynamics
  ├── PolicyRunnerMixin    # run_policy, eval_policy, replay_episode
  ├── RenderingMixin       # RGB/depth offscreen rendering
  ├── RecordingMixin       # LeRobot dataset recording (parquet + MP4)
  └── RandomizationMixin   # domain randomization

🧪 2. Tests — proves the above works

1,030 passing tests (up from ~288 on main). New coverage:

File Lines What it locks in
tests/simulation/mujoco/test_simulation.py 1,024 End-to-end 35-action surface
tests/simulation/mujoco/test_concurrency.py 642 Thread-safety (scene mutations during policy runs)
tests/simulation/test_policy_runner.py 585 Runner loop against a FakeSim backend
tests/simulation/mujoco/test_physics.py 361 All physics APIs (raycast/jacobian/energy/…)
tests/simulation/mujoco/test_e2e.py 314 "Create world → add robot → step → render → record" flows
tests/simulation/mujoco/test_error_paths.py 298 Every error branch (invalid args, missing entities, etc.)
tests/simulation/mujoco/test_tool_spec.py 250 tool_spec.json schema validation + DX contract (public methods match actions)
tests/simulation/test_policy_runner_paths.py 227 Runner error paths, idempotent stop, concurrent-policy conflict
tests/simulation/test_factory.py 185 register_backend happy path + conflicts + alias resolution
tests/simulation/mujoco/test_mjcf_xml_injection.py 124 XML-injection fuzzer (no path traversal / XXE)
tests_integ/simulation/test_mujoco_journeys.py Real-robot integration journeys
tests_integ/simulation/test_multi_robot_tasks.py 141 NEW — multi-agent scene composition, per-robot joint-prefixing, multi-camera recording

Coverage: 53% overall (100% on factory.py, randomization.py; 92% on physics.py; 91% on policy_runner.py; 89% on rendering.py; 86% on simulation.py).

📓 3. Runnable demo — notebooks on a sibling branch

Rather than bloat this PR with output-baked notebooks (>140KB each with embedded
images), they live on the sibling branch pr-85-notebooks.
All three notebooks are committed with their outputs baked in — browse them
on GitHub with rendered images and printed assertions, no local MuJoCo install
needed.

Notebook What it proves
01_mujoco_quickstart.ipynb Learn the sim API: create_worldadd_robotsteprendersend_actionstart_recording. 2 embedded MP4 videos (front cam + wrist cam) of the arm reaching a commanded pose.
02_vla_inference.ipynbheadline demo Load real SmolVLA on Apple MPS, run 60 inference steps @ 20 Hz with the prompt "grasp the green cube". 2 embedded MP4 videos of the actual VLA rollout + parquet action inspection + matplotlib trajectory plot. Cold load ~13s, rollout ~9.5s at ~6.3 Hz effective.
03_multi_robot_vla.ipynb Two SO-101 arms in one world, each driven by SmolVLA with a different instruction. 3 embedded MP4 videos (top + alice wrist + bob wrist). Proves the new multi-robot joint-prefix featureobservation.state.names = [alice__shoulder_pan, …, bob__shoulder_pan, …] — plus a backwards-compat control showing single-robot scenes still get flat names.

All three executed cleanly with MuJoCo 3.8 / lerobot 0.5.1 / SmolVLM2 on Apple MPS. Zero errors, 7 embedded MP4 videos + 3 matplotlib plots + scene previews baked in — watch them directly on GitHub. See notebooks/README.md for the re-run recipe + hardware notes.

🧹 4. Noise to skim past

About 40% of the line count is not functional and can be skimmed:

  • chore: strip emojis/dividers + fix leading-space artifacts (46 files) — removed decorative emojis (✅❌🔌🤖…) from log + tool-result strings and # ──── / # ---- comment dividers. Also fixed 200+ f" {msg}"f"{msg}" artifacts from that strip, and a typo ("errpr""[MISSING]") in model_registry. No behavior change.
  • test: mirror tests/ layout to strands_robots/ source tree (0b95948) — moved test files so tests/simulation/mujoco/… mirrors strands_robots/simulation/mujoco/…. Pure file moves + __init__.py additions.
  • chore: apply ruff format/lint fixes — auto-formatter output only.
  • Existing files touched across strands_robots/policies/, strands_robots/tools/, tests/policies/, tests/registry/ — almost entirely emoji/divider strips; the actual behavior in those files is unchanged.

👉 If a file isn't in the Must-read table above, its diff is (almost certainly) cosmetic.


Usage

from strands_robots.simulation import Simulation
from strands import Agent

sim = Simulation()
agent = Agent(tools=[sim])
agent("Create a world with an so100 robot and a red cube, then step 100 times")

Or imperatively:

sim.create_world()
sim.add_robot(data_config="so100", name="alice")
sim.add_object(name="cube", shape="box", size=[0.03,0.03,0.03], rgba=[1,0,0,1])
sim.step(n_steps=100)
rgb = sim.render(camera="top", width=640, height=480)

Key design decisions

  1. Simulation extends AgentTool directlyAgent(tools=[Simulation()]) just works, no wrapper needed.
  2. Lazy MuJoCo import_ensure_mujoco() only imports the heavy dep when a sim is actually created (keeps CLI startup fast).
  3. XML round-trip for scene mutation — standard approach (same as dm_control, robosuite); lets us add/remove robots and objects after compilation.
  4. Same Policy ABC for sim and real — a policy trained in sim runs on the real robot with zero code changes.
  5. Simulation is standalone — no dependency on Robot(). Addresses Arron's earlier ask: "the abstraction of sim should work standalone without robot too".
  6. Backend registry is extensible — third parties can register_backend("my_sim", MySim) at runtime (covered by test_factory.py).

New this round (final commits on the branch)

Since the last review pass, on top of all the review fixes:

  • Multi-robot recording (4904164) — when a scene holds >1 robot, joint names get per-robot prefixed (alice__shoulder_pan) so LeRobot dataset schemas are unambiguous per agent. Single-robot scenes keep the flat shoulder_pan names (backwards compat).
  • +994 lines of new tests (30e35c0) across 8 files — targeting previously-thin coverage: policy ABC contract, error branches, object-shape injection, recording paths, model registry, module __all__ lazy exports, policy-runner error paths, and the new multi-robot integration test.
  • Cosmetic cleanup (b2498ed) — see Noise to skim past above.

Testing locally

pip install -e ".[all,dev]"
hatch run test              # 1030 passed, 5 skipped, 5 pre-existing macOS-specific failures
hatch run test-integ        # requires GPU + MuJoCo (separate CI job)
hatch run lint              # clean

Depends on #83 (build) and #84 (sim foundation). After this lands, strands_robots.simulation.Simulation is fully usable as a standalone AgentTool.

Comment thread strands_robots/simulation/mujoco/simulation.py
Comment thread strands_robots/simulation/mujoco/simulation.py
Comment thread strands_robots/simulation/mujoco/scene_ops.py
Comment thread strands_robots/simulation/mujoco/recording.py Outdated
Comment thread strands_robots/simulation/mujoco/policy_runner.py Outdated
Comment thread strands_robots/simulation/mujoco/physics.py
Comment thread strands_robots/dataset_recorder.py
Comment thread strands_robots/dataset_recorder.py Outdated
Comment thread strands_robots/_async_utils.py
@cagataycali cagataycali force-pushed the feat/mujoco-backend branch from bc6080f to 78719d9 Compare April 1, 2026 20:03
Copy link
Copy Markdown

@yinsong1986 yinsong1986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All review comments addressed. LGTM.

@cagataycali cagataycali added this to the v0.4 milestone Apr 6, 2026
@cagataycali cagataycali force-pushed the feat/mujoco-backend branch 2 times, most recently from f461f30 to 4a3fd3c Compare April 6, 2026 07:03
@cagataycali
Copy link
Copy Markdown
Member Author

Rebased feat/mujoco-backend onto the updated feat/simulation-foundation (which now has the [sim] extra with robot_descriptions).

pyproject.toml extras now:

sim = [
    "robot_descriptions>=1.11.0,<2.0.0",
]
sim-mujoco = [
    "mujoco>=3.0.0,<4.0.0",
]
all = [
    "strands-robots[groot-service]",
    "strands-robots[lerobot]",
    "strands-robots[sim]",
    "strands-robots[sim-mujoco]",
]

Both robot_descriptions (for asset downloads) and mujoco (for simulation backend) are now properly declared as separate extras and included in [all]. Ready for merge after PR #84 lands.

@cagataycali cagataycali force-pushed the feat/mujoco-backend branch from dda5248 to 696b423 Compare April 6, 2026 07:27
Comment thread pyproject.toml Outdated
Copy link
Copy Markdown
Member

@awsarron awsarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all comments in this PR, we should examine common themes and include corrections for them in AGENTS.md so that future agent runs benefit from their lessons.

Comment thread pyproject.toml
Comment thread strands_robots/simulation/mujoco/__init__.py Outdated
Comment thread strands_robots/simulation/mujoco/simulation.py Outdated
Comment thread strands_robots/simulation/mujoco/backend.py
Comment thread strands_robots/simulation/mujoco/policy_runner.py Outdated
Comment thread strands_robots/simulation/mujoco/tool_spec.json
Comment thread strands_robots/simulation/mujoco/scene_ops.py
cagataycali added a commit to cagataycali/robots that referenced this pull request Apr 13, 2026
Move _xml, _robot_base_xml, and _tmpdir from SimWorld into a generic
_backend_state dict. Each backend stores its format-specific data there
instead of polluting the base class with implementation details.

Addresses @awsarron review: 'how can we avoid having implementation
details (Mujoco) in base classes like this?'

The MuJoCo backend (PR strands-labs#85) will store these in
world._backend_state['xml'], etc. during rebase.
@cagataycali
Copy link
Copy Markdown
Member Author

Review Status Summary

All 17 review threads are now resolved.

Latest commit 6bb195a (Apr 12) fixed the Protocol annotation with TYPE_CHECKING stubs — the last open item.

CI: ✅ All checks passing
Mergeable: ✅ Clean merge with main
Threads: 17/17 resolved
Dependency: Waiting on PR #84 (simulation foundation) to merge first

@awsarron — this is ready for re-review. Once #84 merges, this can follow immediately.


🤖 Pipeline analysis by AI agent. Strands Agents. Feedback welcome!

@cagataycali
Copy link
Copy Markdown
Member Author

📋 Review Status Summary

Hi @awsarron — consolidating the current state of this PR to help with re-review.

Thread Resolution: ✅ 17/17 resolved

All 17 review threads have been addressed and resolved:

Reviewer Topics Covered Status
@awsarron Module naming (mujoco vs sim-mujoco), private function exports removed, _ensure_mujoco centralized to init, headless platform support docs, mixin coupling reduced, action↔method drift test added, XML parsing consistency (ElementTree vs regex) ✅ All resolved
@yinsong1986 SimulationBackend ABC inheritance, self._lock thread safety, XML injection validation, overwrite default safety, total_reward cleanup, tempfile.mktempNamedTemporaryFile, dead code removal, frame-drop strictness, executor reuse, sim-mujoco dependency naming ✅ All resolved

Key changes since CHANGES_REQUESTED:

  • Simulation now inherits from SimulationBackend ABC
  • Thread lock properly acquired around model/data mutations
  • XML name validation: ^[a-zA-Z0-9_-]+$ pattern enforced
  • overwrite defaults to False with FileExistsError
  • tempfile.NamedTemporaryFile replaces mktemp
  • Single reused ThreadPoolExecutor instead of per-call creation
  • Action↔method mapping test added (catches enum drift)

CI: ✅ Passing

Latest commit status: SUCCESS

Dependency context

This PR depends on #84 (simulation foundation, also 50/50 resolved) and is a prerequisite for #86 (Robot factory).


🤖 Automated review triage by Strands Agents. Feedback welcome!

@cagataycali cagataycali requested a review from awsarron April 17, 2026 16:30
cagataycali added a commit to cagataycali/robots that referenced this pull request Apr 17, 2026
Move _xml, _robot_base_xml, and _tmpdir from SimWorld into a generic
_backend_state dict. Each backend stores its format-specific data there
instead of polluting the base class with implementation details.

Addresses @awsarron review: 'how can we avoid having implementation
details (Mujoco) in base classes like this?'

The MuJoCo backend (PR strands-labs#85) will store these in
world._backend_state['xml'], etc. during rebase.
@cagataycali cagataycali modified the milestones: v0.4.0, v0.3.9 Apr 21, 2026
@cagataycali
Copy link
Copy Markdown
Member Author

Follow-up from second-opinion review

Ran another DevDuck pass to pressure-test the PR. Full review lives at /tmp/pr85-review/BRUTAL_REVIEW.md locally. TL;DR: grade B+, ship it, but track 7 follow-ups.

I landed the two safe/fast wins directly on this branch (commits e26275b and d275277):

  1. tool_spec JSON caching — the @property was re-parsing a 357-line JSON on every LLM invocation. Now loaded once at module import, with identity-check regression tests.
  2. Host-path hygiene guard (tests/test_no_host_paths.py) — regex sweep that fails CI if anyone commits a /Users/<name>/ or /home/<name>/ path again. Prevents a repeat of the test_agenttool_contract.py slip this PR already hit.

What I did not do (and why): the review suggested deduping the ~40 if self._world is None or ... checks through the existing _require_world() helper. I tried it — it breaks mypy narrowing across the entire mujoco module (150+ new union-attr errors). Commit f5c8518 on this branch already evaluated and reverted that exact change with a clear rationale. Not reopening.

Structural follow-ups on the project board (all P2, Backlog):

CI should stay green. hatch run lint + hatch run format + hatch run test tests/simulation/ all pass locally.

Fixes GH strands-labs#115. load_scene previously did not populate _backend_state
bookkeeping, so subsequent add_object / add_camera / remove_object
calls either:

  * recompiled the world via MJCFBuilder.build_objects_only (silently
    discarding every body from the loaded scene), or
  * hit the XML round-trip path but fell through to mj_saveLastXML
    global state and emitted the wrong (robot, not scene) XML.

Changes in load_scene:
  * Cache the on-disk scene XML in _backend_state['xml'].
  * Set _backend_state['scene_loaded'] = True as a marker.
  * Record _backend_state['scene_base_dir'] for mesh path resolution
    during injection round-trips.

Changes in add_object / add_camera / remove_object:
  * Gate the XML-round-trip branch on
    'robots OR scene_loaded' instead of 'robots only'.
    Previously the no-robots branch called _recompile_world() which
    rebuilds via MJCFBuilder.build_objects_only and would wipe a
    loaded scene's bodies and meshes.

Changes in scene_ops._get_robot_base_dir:
  * Fall back to _backend_state['scene_base_dir'] when no
    robot_base_xml is registered, so mesh refs in a round-tripped
    scene XML still resolve under tmpdir.

New test file tests/simulation/mujoco/test_load_scene_interaction.py
(9 tests) covers:
  * _backend_state population contract (3 tests)
  * add_object / add_camera / remove_object preserve scene bodies
  * The full load_scene -> add_robot -> add_object chain
  * create_world does NOT set scene_loaded (regression guard)

Verified that all 8 behavioural tests fail on pre-fix code and pass
after the fix. The single trivially-passing test
(create_world_does_not_set_scene_loaded) is a guard against
accidentally setting the flag in the non-load_scene path.

Full suite: 524 passed, 1 skipped (was 515; +9 new). Lint clean.
…#114)

Fixes GH strands-labs#114. Previously, despite _policy_threads being keyed by
robot_name (implying concurrency), the _require_no_running_policy
helper blocked *any* live Future on every scene mutation AND on every
start_policy call. Two VLA arms could not actually run policies in the
same scene.

The mixin's docstring even said 'per robot' while semantics were serial.

## Semantics changes

start_policy(X):
  * Was: global check — rejected if ANY Future was not done.
  * Now: per-robot check — only rejected if X's own Future is live.
    Policies on different robots can coexist.

remove_robot(X):
  * Was: errored when X had a live policy; required two-step
    stop_policy(X) + remove_robot(X).
  * Now: gracefully stops X's own policy (as before), then runs the
    XML-round-trip ejection. Still errors if a DIFFERENT robot has a
    live policy, because that robot's PolicyRunner holds cached
    actuator/joint IDs that the recompile invalidates.

Scene mutations (add_robot, add_object, add_camera, remove_object,
remove_camera, load_scene, reset, set_gravity, set_timestep, randomize,
apply_force, set_body_properties, set_geom_properties, set_joint_*,
load_state, move_object): still block on ANY live policy. Unchanged.
The error message now NAMES the active-policy robots so the LLM can
stop_policy on each without guessing:

    Cannot 'set_gravity' while a policy is running on 'armA', 'armB'.
    Stop it first: action='stop_policy'.

## New helpers

* _prune_done_futures() — drops completed Futures from
  _policy_threads (GH strands-labs#120 companion fix). Previously the dict grew
  unboundedly and list_policies_running would leak historical names
  as 'running'.
* _active_policy_robots() — returns names with LIVE policies. Prunes
  stale entries as a side-effect so the returned list is authoritative.
* _require_no_running_policy(action_name, robot_name=None) — new
  keyword arg scopes the check to one robot. robot_name=None is the
  existing global-scope behaviour.

## New action

list_policies_running — returns the names of robots with live
policies. Idempotent, always succeeds, prunes stale entries.

Added to tool_spec.json enum and to the tool_spec description so the
LLM discovers it.

## Why this is safe

MuJoCo's mj_step and ctrl[] writes are still serialized via self._lock,
which is the single point that makes concurrent multi-robot policies
safe:

  * Two policies on different robots run in parallel at the inference
    level (observation build, action compute — no shared state).
  * When either calls send_action, it serializes briefly on self._lock
    to write its own ctrl[] slots and advance physics.
  * mj_step advances the WHOLE scene — so two robots sharing a world
    share one physics clock. That's correct: one tick of physical time
    advances all bodies.
  * Each robot writes to a DISJOINT slice of data.ctrl[], indexed by
    actuator IDs specific to that robot's namespaced actuators (set up
    by inject_robot_into_scene via _prefix_robot_names). No ctrl[]
    aliasing.

Documented inline on __init__ and in start_policy's docstring.

## Tests

tests/simulation/mujoco/test_concurrency.py — adds
TestConcurrentPerRobotPolicies class with 6 new tests:

  * test_start_policy_allowed_on_second_robot_while_first_runs
  * test_start_policy_still_rejected_on_SAME_robot
  * test_list_policies_running_reports_active
  * test_completed_futures_are_pruned (GH strands-labs#120 companion)
  * test_scene_mutation_lists_which_robots_are_running
  * test_two_policies_no_segfault_under_stress — actually runs two
    policies to completion and asserts both produced policy_steps > 0

Updated the existing test_remove_robot_blocked_during_policy (which
encoded the old 'error when same-robot policy active' semantics) into
two tests that reflect the new semantics:

  * test_remove_robot_stops_own_policy_and_succeeds
  * test_remove_robot_blocked_by_OTHER_robot_policy

Verified: the new tests fail on pre-fix simulation.py (stashed to
confirm), pass on post-fix code.

## Numbers

* 525 -> 531 passed (+6 new) in tests/simulation/
* hatch run lint: clean (no new errors)
* hatch run format: clean
* CHANGELOG.md updated with both the concurrency change and the new
  list_policies_running action.

Closes strands-labs#114. Companion fix for strands-labs#120 (stale Future pruning).
@cagataycali
Copy link
Copy Markdown
Member Author

Landing #114 + #115 on this PR

Two more follow-ups from the second-opinion review now land directly on this branch.


Commit 3a6ad50fix(sim/mujoco): load_scene + add_* interaction (#115)

The bug. load_scene skipped the _backend_state bookkeeping that create_world populates. With no robots registered, a subsequent add_object / add_camera / remove_object took the else branch that calls _recompile_world() → which rebuilds from MJCFBuilder.build_objects_only and silently wiped every body from the loaded scene. The add_robot path hit inject_robot_into_scene but fell through to mj_saveLastXML global state and emitted the wrong (robot, not scene) XML.

The fix.

  • load_scene caches the on-disk XML in _backend_state['xml'], sets _backend_state['scene_loaded'] = True, and records scene_base_dir for mesh-path resolution.
  • add_object / add_camera / remove_object gate the XML round-trip branch on robots OR scene_loaded instead of robots only.
  • scene_ops._get_robot_base_dir falls back to scene_base_dir so mesh refs in a round-tripped scene XML still resolve.

New tests. tests/simulation/mujoco/test_load_scene_interaction.py (9 tests) covering the full chain: load_scene → add_robot → add_object. Confirmed all 8 behavioural tests fail on pre-fix code.


Commit 306220efeat(sim/mujoco): support concurrent per-robot policies (#114)

The gap. _policy_threads: dict[str, Future] was keyed by robot_name, the docstrings talked about "per-robot" concurrency — but _require_no_running_policy hit every callsite with a global any(not f.done()) check, including start_policy itself. Two VLA arms couldn't actually run policies in the same scene.

The fix (scoped, not a rewrite).

Action Before After
start_policy(X) Global check — errored if ANY policy was live Per-robot check — only X's own gate
remove_robot(X) Errored if X had a live policy; required two-step stop+remove Gracefully stops X's own policy, then does the XML round-trip
Scene mutations Global check, fires on any live policy Global check, but error message now NAMES which robot(s) are active

New helpers: _active_policy_robots(), _prune_done_futures() (companion fix for #120). _require_no_running_policy(action_name, robot_name=None) now takes an optional scope.

New action: list_policies_running — returns the names of robots with live policies. Added to tool_spec.json and the tool description so the LLM discovers it.

Why this is safe. self._lock still serializes mj_step and ctrl[] writes (MuJoCo isn't thread-safe for concurrent mutation). Two policies on different robots:

  • run in parallel at the inference level (observation build, action compute — no shared state)
  • serialize briefly on self._lock when calling send_action
  • write to disjoint slices of data.ctrl[] (each robot's actuators are namespaced via _prefix_robot_names during injection)
  • share one physics clock (correct: one mj_step = one tick for the whole scene)

New tests. TestConcurrentPerRobotPolicies in test_concurrency.py (6 tests):

  • test_start_policy_allowed_on_second_robot_while_first_runs
  • test_start_policy_still_rejected_on_SAME_robot
  • test_list_policies_running_reports_active
  • test_completed_futures_are_pruned (closes sim/mujoco: _policy_threads dict accumulates completed Future refs forever #120 too)
  • test_scene_mutation_lists_which_robots_are_running
  • test_two_policies_no_segfault_under_stress — actually runs both policies to completion, asserts policy_steps > 0 on both robots

Updated test_remove_robot_blocked_during_policy → split into test_remove_robot_stops_own_policy_and_succeeds + test_remove_robot_blocked_by_OTHER_robot_policy. Verified the new tests fail on pre-fix code via git stash.

CHANGELOG updated with both changes.


Numbers

  • tests/simulation/: 531 passed, 1 skipped (was 515; +16 across both commits)
  • hatch run lint: clean
  • hatch run format: clean
  • Full suite: 1236 passed (5 failures remain pre-existing in test_path_validation.py, unrelated)

Status of the 7 follow-up issues

…s-labs#117)

Fixes GH strands-labs#117. PolicyRunner.run previously caught ALL on_frame
exceptions (other than CooperativeStop) at WARN level and kept
iterating. Failure mode: a recording hook with a typo'd observation
key would raise on every step, produce one log line per step for
500 steps, and complete 'successfully' with zero frames written.
The resulting dataset is silently empty.

Fix: count *consecutive* on_frame failures. After N in a row (default
5, overridable via new kwarg max_onframe_failures), raise RuntimeError
so run() returns status=error with a clear message. A single transient
failure still logs at WARN and keeps going — the counter resets on
the next successful call.

Plumbed the new kwarg through:
  * PolicyRunner.run (core)
  * SimEngine.run_policy (base)
  * Simulation.run_policy (MuJoCo override)

Tests: 4 new in TestOnFrameFailureCounter class:
  * test_single_onframe_failure_is_tolerated
  * test_consecutive_onframe_failures_abort_episode
  * test_consecutive_counter_resets_on_success
  * test_default_threshold_is_5

All 535 tests pass (was 531; +4 new). Lint clean.
…strands-labs#116)

Fixes GH strands-labs#116. Previously cleanup() called executor.shutdown(wait=False)
right after setting self._world = None, which opened a race window where
a policy worker still inside mj_step(world._model, world._data) would
segfault on freed arrays. The 'policy_running = False' flag was set but
never awaited.

New cleanup order:
  1. Signal every live policy to stop (policy_running = False).
  2. Await each outstanding Future with a bounded timeout. The on_frame
     hook sees the flag at the top of its next call and raises
     CooperativeStop, which short-circuits run_policy.
  3. Workers that don't stop within the timeout get logged as a warning
     and abandoned — cleanup proceeds rather than hanging the host
     process on exit.
  4. Only AFTER workers have unwound do we null self._world and tear
     down renderers / viewer / executor.

New kwarg: cleanup(policy_stop_timeout=...) for tests and edge cases.
Defaults to 5.0s via a module-level _DEFAULT_POLICY_STOP_TIMEOUT
constant. None (default) uses the constant.

Tests: 4 new in TestCleanupGracefulShutdown:
  * test_cleanup_awaits_running_policy — verifies Future.done() by the
    time cleanup returns
  * test_cleanup_tolerates_wedged_policy — proves cleanup returns in
    bounded time even with an aggressively-short 1ms timeout
  * test_cleanup_is_idempotent_with_no_policies — no-op when there are
    no live Futures
  * test_cleanup_drains_multiple_concurrent_policies — pairs with
    GH strands-labs#114 concurrent-policy support; both robots' futures awaited

All 539 tests pass (was 535; +4 new). Lint clean.
Closes GH strands-labs#119. The mutation guard (_require_no_running_policy) is the
load-bearing safety mechanism that stops the LLM from scheduling a
scene mutation while a policy worker is mid-step. A race between the
guard and the worker's mj_step is a SIGSEGV on stale pointers. We had
a partial stress test in strands-labs#114's commit (two policies run to
completion), but no test that proved:

  * 1000 concurrent main-thread mutation attempts don't starve the
    worker
  * rapid start/stop/start/stop cycles leave _policy_threads clean
  * the first mutation after a policy completes succeeds (no
    lingering guard state)
  * two concurrent policies + 500 main-thread mutations don't deadlock
    on self._lock

New TestMutationGuardStress class covers all four:
  * test_1000_set_gravity_calls_during_policy_never_segfault
  * test_rapid_start_stop_start_stop_policy
  * test_mutation_accepted_immediately_after_policy_completes
  * test_concurrent_policies_stress_no_deadlock (pairs with GH strands-labs#114)

Each test asserts well-formed dict responses (no crashes), specific
status invariants, and the uniform 'policy is running' error shape
when blocked.

All 543 tests pass (was 539; +4 new). Lint clean.
Per user request during autonomous review cycle. The em-dash (—) and
horizontal ellipsis (…) unicode characters sneak in when docstrings
get authored in text editors with smart-quote autocorrect. They look
fine in rendered markdown but are noisy in code and diffs, don't
copy-paste cleanly into terminals, and break grep with non-unicode
patterns.

Bulk replacements:
  * 424 em-dashes ('—' U+2014) -> ' - ' (with normalized spacing) or '-'
    (at line start, mostly bullet points)
  * 8 horizontal-ellipsis ('…' U+2026) -> '...' (three ASCII dots)

Also fixed one arithmetic bug surfaced by the ellipsis replacement:
  * strands_robots/registry/robots.py: description-truncation
    previously subtracted 1 char (for the 1-char ellipsis) and
    appended 3 chars (for '...'), overflowing the table column by
    2 chars. Now subtracts 3.

Files touched:
  * 39 in strands_robots/
  * 30 in tests/
  * 5 in tests_integ/
  * CHANGELOG.md, README.md, AGENTS.md
  * 82 files total

No semantic changes. All 1248 tests pass (was 1236, 5 pre-existing
test_path_validation failures unrelated).
…trands-labs#118)

Partial address of GH strands-labs#118. The review correctly flagged that the
4-way mixin split (PhysicsMixin + RenderingMixin + RecordingMixin +
RandomizationMixin) pretends to describe a decoupling when it really
just describes *where lines live*. Every mixin reaches back into
Simulation for self._world / self._lock / self._mj / _policy_threads /
_renderer_tls, plus the cross-cutting _require_no_running_policy /
_require_world / _prune_done_futures helpers.

Rather than pretend otherwise, this commit makes the coupling
documentary and explicit:

1. simulation.py module-level docstring replaced with a full
   'Architecture notes (honest version)' block that enumerates every
   piece of shared state and every cross-cutting helper the mixins
   rely on. Cross-refs GH strands-labs#118 and commit f5c8518 (which established
   that the alternative -- _SimulationState extraction -- breaks mypy
   narrowing across the helper boundary).

2. Every mixin's class docstring rewritten to name the specific state
   it touches and the specific helpers it calls. Short, precise,
   greppable.

3. TYPE_CHECKING stubs in each mixin updated to reflect the NEW
   per-robot _require_no_running_policy signature (from strands-labs#114) and to
   add _require_world which previously was missing despite being
   used. Now when we edit the real helpers in simulation.py, mypy
   can check the mixin call sites against the intended shape.

4. Class body order normalized: docstring first, THEN TYPE_CHECKING
   block. Previously PhysicsMixin and RandomizationMixin had the stub
   block *before* the class docstring, which hid the real
   documentation.

No runtime behavior change. Lint clean. 543 tests pass.

This leaves the bigger structural question (actually extract
_SimulationState, or merge mixins back into one file) open. That's
tracked on strands-labs#118 -- it's an L/XL refactor and needs its own PR. For
THIS PR, the goal was to stop the split from being *dishonest*.
Cosmetic/quality sweep surfaced a dead return value: _ensure_meshes
returned an error dict on auto-download failure, but the caller at
add_robot (line 494) discarded the return value. Result: the agent
got a cryptic 'mesh not found' from MuJoCo later instead of the clear
'Auto-download failed for X: Y. Install robot_descriptions:...' that
_ensure_meshes constructs.

Changes:
  * _ensure_meshes typed as -> dict[str, Any] | None explicitly
  * Explicit return None on all success paths (previously the function
    fell off the end in places, which implicitly returned None but
    was not self-documenting)
  * Caller in add_robot now checks the return value; propagates any
    error dict and pops the partially-registered robot out of
    self._world.robots before bubbling up

No test change -- the existing happy-path tests still pass, and the
error path requires network-blocked CI to test cleanly (left as
integration territory). Lint and all 543 tests pass.
@cagataycali
Copy link
Copy Markdown
Member Author

All 7 review follow-ups now landed on this PR

Second autonomous cycle landed the remaining structural issues from the brutal-review follow-up list.


daaf421 - fix(sim): abort episode after N consecutive on_frame failures (#117)

  • PolicyRunner.run counts consecutive on_frame failures. After N in a row (default 5, tunable via new max_onframe_failures kwarg), raises RuntimeError so run() returns status=error with a clear message.
  • Counter resets on next successful call - single transient failures still log + continue.
  • Kwarg threaded through PolicyRunner.run -> SimEngine.run_policy -> Simulation.run_policy.
  • 4 new tests in TestOnFrameFailureCounter.

296406f - fix(sim/mujoco): cleanup awaits running policies before nulling world (#116)

  • New cleanup order: signal policy stop -> await Future with bounded timeout -> null self._world -> teardown renderers/viewer/executor.
  • Wedged workers log a warning and get abandoned rather than hanging the host process on exit.
  • New kwarg cleanup(policy_stop_timeout=...) with _DEFAULT_POLICY_STOP_TIMEOUT = 5.0.
  • 4 new tests in TestCleanupGracefulShutdown including 1ms-timeout stress test.

5c11959 - test(sim/mujoco): mutation guard stress tests (#119)

  • 1000 concurrent main-thread mutation attempts during a live policy: no crashes, no starvation.
  • 10 rapid start/stop cycles: _policy_threads stays clean.
  • Mutation immediately after policy completes succeeds: no lingering guard state.
  • Two concurrent policies + 500 main-thread mutations: no deadlock on self._lock.
  • 4 new tests in TestMutationGuardStress.

9f1b4ac - style: replace em-dashes (U+2014) with ASCII hyphens

Per-user cosmetic cleanup of the whole codebase:

  • 424 em-dashes (U+2014) -> - (with normalized spacing) or -
  • 8 horizontal-ellipsis (U+2026) -> ...
  • Touched 82 files (39 source + 30 tests + 5 integration + 3 docs)
  • Fixed one arithmetic bug the ellipsis replacement surfaced: strands_robots/registry/robots.py description truncation was subtracting 1 char (for old 1-char ellipsis) but appending 3 chars for ..., overflowing the column by 2.

6c97df5 - docs(sim/mujoco): honest documentation of the mixin coupling graph (#118)

Rather than attempt a risky full refactor (the _SimulationState extraction breaks mypy narrowing through the helper boundary, same limitation that led commit f5c8518 to back out helper-based _require_world dedup), this commit lands the documentary fix the review really cared about:

  1. simulation.py module-level docstring: full 'Architecture notes (honest version)' block enumerating all shared state and cross-cutting helpers.
  2. Every mixin's class docstring rewritten to name the specific state it touches and the specific helpers it calls.
  3. TYPE_CHECKING stubs in each mixin updated to the NEW per-robot _require_no_running_policy signature and _require_world.
  4. Class body order normalized: docstring FIRST, then TYPE_CHECKING block.

The architectural question (keep mixins vs. merge vs. extract _SimulationState) stays tracked on #118 for a dedicated future PR.

e20c540 - fix(sim/mujoco): propagate _ensure_meshes auto-download errors

Sanity-check pass found a dead return value: _ensure_meshes returned an error dict on auto-download failure, but add_robot discarded it. Agent got a cryptic MuJoCo 'mesh not found' instead of the clear 'Install robot_descriptions: ...' we were trying to surface. Caller now propagates the error dict and cleans up the half-registered robot.

3894a43 - CHANGELOG updated for all the above.


Numbers after this cycle

All 7 follow-up issues now resolved

Issue Status Commit
#114 concurrent per-robot policies Done 306220e
#115 load_scene interaction Done 3a6ad50
#116 cleanup race Done 296406f
#117 on_frame swallow Done daaf421
#118 god-class docs Done (docs) 6c97df5
#119 concurrency stress Done 5c11959
#120 stale Future pruning Done 306220e (companion)

@cagataycali cagataycali changed the title feat: MuJoCo simulation backend — AgentTool with 35 actions feat: MuJoCo simulation backend - AgentTool with 35 actions May 5, 2026
Start the string-concat -> MjSpec refactor documented in IDEA.md.
The current 'builder' (mjcf_builder.MJCFBuilder) hand-writes MJCF as
f-string concatenation and needs ~600 lines of scaffolding (sanitize,
xyaxes math, ElementTree round-trips) to do what mujoco.MjSpec (shipped
in mujoco 3.2+) does natively.

This PR lands Stages 0-2:

  * Stage 0: bump mujoco floor to 3.2 (env already at 3.8), track IDEA.md
    in repo so the plan is visible to agents.
  * Stage 1: add spec_builder.py alongside mjcf_builder.py, behind a
    STRANDS_SIM_USE_MJSPEC env flag. Both code paths tested in CI.
  * Stage 2: replace camera xyaxes math with mujoco.mju_mat2Quat via
    the new _target_quat helper. MjSpec cameras use quat= instead of
    xyaxes=; compiled cam_mat0 is numerically identical within 4e-7.

## Changes

### New: strands_robots/simulation/mujoco/spec_builder.py (+260 LOC)

Exports:
  * SpecBuilder.build(world) -> mujoco.MjSpec
  * _geom_type(shape) -> mjtGeom enum (drop-in for the shape enum
    lookup; also adds 'ellipsoid' which the legacy builder rejects)
  * _normalize_size(shape, size) -> per-type size list
  * _target_quat(pos, target) -> look-at quaternion via mju_mat2Quat

Does NOT touch scene_ops (Stages 5-6), robots (Stage 3-4), or remove
the legacy builder (Stage 7). Pure additive.

### Modified: simulation.py

_compile_world gets a feature flag (STRANDS_SIM_USE_MJSPEC). When on:
  1. Build MjSpec via SpecBuilder.build()
  2. Stash the spec in _backend_state['spec']
  3. Compile to MjModel
  4. Export _backend_state['xml'] via spec.to_xml() for legacy readers

New helper Simulation._use_mjspec() centralises env-var parsing.

### Modified: test_input_validation.py

Two tests were asserting on raw 'xyaxes="..."' XML strings, which
aren't emitted under the SpecBuilder path (quat= instead). Rewrote
them to assert on cam_mat0 (the compiled rotation matrix) which is
representation-agnostic and passes under BOTH code paths:

  * test_xyaxes_emitted_in_xml         -> test_camera_orientation_written
  * test_different_targets_produce_different_xyaxes
      -> test_different_targets_produce_different_orientations

### New: test_spec_builder.py (+170 LOC, 19 tests)

Locks the SpecBuilder contract:
  * Module-level helpers (_geom_type, _normalize_size, _target_quat)
    with unit coverage including the new ellipsoid support.
  * Parity class: builds the same SimWorld via both paths and asserts
    nbody/ngeom/ncam/nu/njnt/nq/nv match exactly, plus body_pos,
    body_mass, cam_mat0.
  * Bonus: test_ellipsoid_compiles_via_spec_builder - proves the new
    shape that the legacy builder rejects.

### Modified: pyproject.toml

mujoco>=3.0.0 -> mujoco>=3.2.0 (MjSpec API). Current hatch env is at
3.8.0 already; this is just a floor bump.

### Added: IDEA.md

Full staged refactor plan at repo root (from user). Tracks what's
done and what's left.

## Safety

All existing tests pass under BOTH code paths:
  * Default (legacy):   562 passed, 1 skipped
  * STRANDS_SIM_USE_MJSPEC=1: 562 passed, 1 skipped

Including the tests-that-were-string-coupled. Flag default is OFF so
no existing consumer sees any behaviour change.

## Follow-ups (tracked in subsequent issues on PVT_kwDOD151Fs4BSRJP)

  * Stage 3: single-robot attach via spec.attach(robot_spec, ...)
  * Stage 4: multi-robot compose via repeated spec.attach()
  * Stage 5: port scene_ops inject_*/eject_* to spec.recompile(model, data)
  * Stage 6: replace_scene_mjcf / patch_scene_mjcf tool-facing entry points
  * Stage 7: remove feature flag, delete mjcf_builder.py
Complete the string-concat -> MjSpec migration started in ad1d298.
mjcf_builder.py is deleted, scene_ops.py collapses from ~980 lines to
~295 lines of direct MjSpec AST manipulation, and agents get a new
replace_scene_mjcf escape hatch for raw MJCF.

## What landed

### Stage 3-4: single + multi robot via spec.attach()
- SpecBuilder.attach_robot() composes URDF/MJCF robots via
  mujoco.MjSpec.from_file(...) + scene_spec.attach(robot_spec,
  prefix=..., frame=...). No more hand-rolled name prefixing
  (_prefix_robot_names, ~120 lines) or default-class namespacing
  (_namespace_robot_default_classes, ~60 lines) - MjSpec does it.
- Asset deduplication (meshes/textures/materials) is free via attach().

### Stage 5: live inject/eject via spec.recompile(model, data)
- scene_ops.inject_object_into_scene: one call -
  SpecBuilder.add_object(spec, obj) + spec.recompile(model, data).
- scene_ops.eject_body_from_scene: spec.body(name).delete() + recompile.
- scene_ops.inject_camera_into_scene, inject_robot_into_scene same shape.
- spec.recompile() preserves qpos/qvel for unchanged joints
  automatically, no manual state-copy loop needed.
- Gone: _patch_xml_paths, _rewrite_mesh_paths, _get_abs_meshdir,
  _save_and_patch_xml, the whole tmpdir+mj_saveLastXML dance.

### Stage 6: agent-authored raw MJCF
- scene_ops.replace_scene_mjcf(world, xml): validates by actually
  calling spec.compile() - returns the MuJoCo compiler error verbatim
  on failure, no process abort.
- Exposed as new tool action replace_scene_mjcf in tool_spec.json.
- Simulation.replace_scene_mjcf() guards against policy-running races
  the same way load_scene / add_robot do.

### Stage 7: cleanup
- Deleted strands_robots/simulation/mujoco/mjcf_builder.py (273 lines).
- Deleted tests/simulation/mujoco/test_mjcf_builder_units.py (190 lines).
- Deleted tests/simulation/mujoco/test_mjcf_xml_injection.py (124 lines)
  - XML injection fuzzing is no longer applicable; MjSpec validates
  names itself.
- scene_ops.py: 980 -> 307 lines (meets the <500-line success criterion
  in IDEA.md).
- test_spec_builder.py rewritten to assert on spec structure +
  compiled MjModel properties, never on exact XML strings.
- New tests/simulation/mujoco/test_replace_scene_mjcf.py covers happy
  path (incl. <tendon> that SimObject can't express), malformed XML,
  semantically invalid MJCF, and the policy-running guard.

## Known constraint

eject_robot_from_scene rebuilds the scene from scratch (drops qpos/qvel
for in-place joints) rather than calling spec.delete() on the attached
robot body. This works around a MuJoCo 3.8 double-free: calling
spec.delete() on a body produced by spec.attach() and then letting the
interpreter shut down crashes with a segfault in the spec destructor.
Documented inline. Not a regression - legacy code path also had to
rebuild for remove_robot.

## Verification

- hatch run test tests/simulation/mujoco/: 415 passed, 1 skipped.
- hatch run lint: ruff + mypy clean across 108 source files.
- grep -r "f'<" strands_robots/simulation/mujoco/: no matches
  (IDEA.md success criterion).
- grep -r "MJCFBuilder\|mjcf_builder" strands_robots/: only
  historical doc comments in spec_builder.py docstring and
  simulation/__init__.py package tree diagram.

The 5 pre-existing failures in tests/tools/test_path_validation.py
are unrelated to this refactor - they fail on base commit too.

## IDEA.md checklist

- [x] mjcf_builder.py deleted
- [x] scene_ops.py under 500 lines (307)
- [x] All existing unit + integration tests pass
- [x] Integration test proves agent can author MJCF with <tendon>
      (unexpressible via SimObject)
- [x] No test asserts on exact XML strings
- [x] grep -r "f'<" strands_robots/simulation/mujoco/ empty

Closes stages 3-7 of IDEA.md. Refs strands-labs#121-strands-labs#126.
…trands-labs#125)

Completes the second half of IDEA.md Stage 6 alongside replace_scene_mjcf.
Where replace_scene_mjcf atomically swaps the whole scene for an
agent-written XML string, patch_scene_mjcf applies a list of small
structured ops to the LIVE spec and recompiles once at the end -
cheaper for surgical edits that don't need full-scene XML.

## Tool surface

New action patch_scene_mjcf with one parameter 'ops', a list of dicts.
Supported op kinds (kept narrow on purpose - a wider surface would be
an arbitrary-code hole; exotic MJCF goes through replace_scene_mjcf):

  {'op': 'add_body',      'parent': 'world', 'name': 'foo', 'pos': [...]}
  {'op': 'add_geom',      'body': 'foo',     'type': 'sphere', 'size': [...]}
  {'op': 'add_site',      'body': 'foo',     'name': 'tip', 'pos': [...]}
  {'op': 'set_body_pos',  'name': 'foo',     'pos': [...]}
  {'op': 'set_body_quat', 'name': 'foo',     'quat': [...]}
  {'op': 'delete_body',   'name': 'foo'}

## Atomicity

The whole batch is applied first, then a single spec.recompile(model,
data) is called. If ANY op raises, the spec is restored from an XML
snapshot taken before the batch started and the original error is
re-raised as ValueError - the world is never left in a half-patched
state.

## Implementation notes

Two non-obvious MjSpec 3.8 behaviours addressed:

1. spec.body(name) only resolves bodies that existed at the last
   compile() / recompile(). A body added mid-batch is NOT visible
   through that lookup. Fix: track handles by name in a batch-local
   dict (new_bodies) as we create them, plus a fallback scan over
   spec.bodies for bodies introduced via spec.attach() outside the
   current batch. See _find_body() docstring.

2. MjsBody has no .delete() method - deletion is a spec-level
   operation: spec.delete(body). Caught by the test suite.

## Tests

New tests/simulation/mujoco/test_patch_scene_mjcf.py with 13 cases:
  * happy path: add_body+add_geom, set_body_pos, delete_body,
    add_site, empty ops is no-op.
  * atomic rollback: failed op in middle of batch leaves earlier
    bodies out of the compiled model; missing required field is
    rejected.
  * error paths: no world, non-list input, blocked during policy.
  * tool_spec: action + ops parameter advertised in the schema.
  * state preservation: a post-patch step() still works.

## Totals

- scene_ops.py: 307 -> 481 lines (still well under the 500-line
  success criterion from IDEA.md).
- tests: 415 -> 428 passing (+13, one still skipped).
- lint: ruff + mypy clean across 109 source files.

Closes GH strands-labs#125 stage 6 (agent-authored raw MJCF).
cagataycali and others added 3 commits May 5, 2026 02:45
The ==== and ---- divider comments were cosmetic scaffolding left over
from the initial patch_scene_mjcf + MjSpec refactor commits. They
bloat the diff without helping readers navigate - ruff/editors
already fold functions cleanly and the docstrings carry the section
headings.

Removed 28 banner lines across:
  - strands_robots/simulation/mujoco/scene_ops.py (8)
  - strands_robots/simulation/mujoco/spec_builder.py (2)
  - tests/simulation/mujoco/test_replace_scene_mjcf.py (8)
  - tests/simulation/mujoco/test_spec_builder.py (10)
  - tests_integ/simulation/test_mujoco_journeys.py (reformatted by
    ruff after comment removal)

Verified:
  - hatch run test tests/simulation/mujoco/: 428 passed, 1 skipped
  - hatch run lint: ruff + mypy clean across 109 source files
  - /tmp/mjspec-devx-research/devx_research.py: 29/29 probes pass
    (new DevX probe deck, mirrors /tmp/agent-devx-research pattern)

No functional changes.
… headless CI

The _can_render() probe used to call mj.Renderer() directly in-process.
On CI environments where libEGL.so.1 is loadable but non-functional
(e.g. no GPU driver), this triggers a C-level abort (SIGABRT) that kills
the entire test process before Python can catch the error.

Fix: run the rendering probe in a subprocess. If the child dies (SIGABRT,
exit code != 0, timeout), _can_render() returns False and rendering tests
are cleanly skipped via @requires_gl.

Also adds @requires_gl to test classes that exercise rendering:
- TestRenderCameraValidation (test_input_validation.py)
- TestRendererTLSCache (test_renderer_hygiene.py)
- TestCamerasRecordingWithoutLerobot (test_recording_backends.py)
- test_render_* (test_error_paths.py)
…ene_mjcf

## The bug

Before this fix:

    >>> sim.create_world()
    >>> sim.replace_scene_mjcf('<mujoco>...</mujoco>')
    >>> sim.export_xml()
    FatalError: No XML model loaded

export_xml() called mj.mj_saveLastXML() which relies on MuJoCo's internal
'last loaded XML' cache. That cache is populated only when the model is
loaded from XML via mj_loadLastXML / MjModel.from_xml_*. After the
MjSpec-based replace_scene_mjcf (which compiles from an MjSpec instead),
the cache is empty and mj_saveLastXML raises a C-level FatalError.

The same bug applies to patch_scene_mjcf - recompiling a spec doesn't
populate MuJoCo's 'last XML' cache either.

## How it was found

Agent-in-the-loop probe at /tmp/e2e_agentic_test_85/notebooks/e2e_agentic_test_85.ipynb
scenario S2_equality: the LLM naturally called export_xml after
replace_scene_mjcf (to verify what it had installed) and the exception
bubbled to the agent as an unhandled tool error. Programmatic tests don't
catch this because they don't mix replace_scene_mjcf with export_xml.

## The fix

Prefer spec.to_xml() when world._backend_state['spec'] is present - the
MjSpec is always canonical. Fall back to mj_saveLastXML only when no
spec is tracked (legacy load_scene paths that bypass SpecBuilder).

Also: file-write path now uses plain file I/O (open + write) instead of
mj_saveLastXML so the --output-path flow works uniformly.

## Tests

New tests/simulation/mujoco/test_export_xml_after_replace.py with 4 cases:
  * export after replace_scene_mjcf returns the new scene's XML
  * export after patch_scene_mjcf returns the patched scene's XML
  * export(--output=path) writes a file containing the patched scene
  * export before create_world still errors cleanly (unchanged)

## Totals

- hatch run test tests/simulation/mujoco/: 432 passed, 1 skipped (was 428)
- hatch run lint: ruff + mypy clean across 110 source files
- /tmp/e2e_agentic_test_85 rerun: 5/5 scenarios will now pass cleanly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

4 participants